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Abstract — In this work, we propose a new optimization frame- 
work for multiclass boosting learning. In the literature, Ad- 
aBoost.MO and AdaBoost.ECC are the two successful multiclass 
boosting algorithms, which can use binary weak learners. We 
explicitly derive these two algorithms' Lagrange dual problems 
based on their regularized loss functions. We show that the 
Lagrange dual formulations enable us to design totally-corrective 
multiclass algorithms by using the primal-dual optimization 
technique. Experiments on benchmark data sets suggest that 
our multiclass boosting can achieve a comparable generalization 
capability with state-of-the-art, but the convergence speed is 
much faster than stage-wise gradient descent boosting. In other 
words, the new totally corrective algorithms can maximize the 
margin more aggressively. 

Index Terms — Multiclass boosting, totally corrective boosting, 
column generation, convex optimization. 

I. Introduction 

Boosting is a powerful learning technique for improving 
the accuracy of any given classification algorithm. It has been 
attracting much research interest in the machine learning and 
pattern recognition community. Since Viola and Jones applied 
boosting to face detection [1], it has shown great success in 
computer vision, including the applications of object detection 
and tracking [2], [3] generic object recognition [4], image 
classification [5] and retrieval [6]. 

The essential idea of boosting is to find a combination of 
weak hypotheses generated by a base learning oracle. The 
learned ensemble is called the strong classifier in the sense 
that it often achieves a much higher accuracy. One of the 
most popular boosting algorithms is AdaBoost [7], which 
has been proven a method of minimizing the regularized 
exponential loss function [7], [8]. There are many variations 
on AdaBoost in the literature. For example, LogitBoost [9], 
optimizes the logistic regression loss instead of the exponential 
loss. To understand how boosting works, Schapire et al. [10] 
introduced the margin theory and suggested that boosting is 
especially effective at maximizing the margins of training 
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examples. Based on this concept, Demiriz et al. [11] proposed 
LPBoost, which maximizes the minimum margin using the 
hinge loss. 

Since most of the pattern classification problems in real 
world are multiclass problems, researchers have extended bi- 
nary boosting algorithms to the multiclass case. For example in 
[12], Freund and Schapire have described two possible exten- 
sions of AdaBoost to the multiclass case. AdaBoost. Ml is the 
first and perhaps the most direct extension. In AdaBoost.Ml, 
a weak hypothesis assigns only one of C possible labels 
to each instance. Consequently, the requirement for weak 
hypotheses that training error must be less than 1/2 becomes 
harder to achieve, since random guessing only has an accuracy 
rate of 1/C in multiclass case. To overcome this difficulty, 
AdaBoost.M2 introduced a relaxed error measurement termed 
pseudo-loss. In AdaBoost. M2, the weak hypothesis is required 
to answer C — 1 questions for one training example (xi,yi): 
which is the label of x%, c or yi (Vc ^ j/J? A falsely matched 
pair (xi,c) is called a mislabel. Pseudo-loss is defined as the 
weighted average of the probabilities of all incorrect answers. 
Recently, Zhu et al. [13] proposed a multiclass exponential 
loss function. Boosting algorithms based on this loss, including 
SAMME [13] and GAMBLE [14] only require the weak 
hypothesis performs better than random guessing (1/C). 

The above-mentioned multiclass boosting algorithms have 
a common property: the employed weak hypotheses should 
have the ability to give predictions on all C possible labels at 
each call. Some powerful weak learning methods may be com- 
petent, like decision trees. However, they are complicated and 
time-consuming for training compared with binary learners. A 
higher complexity of assembled classifier often implies a larger 
risk of over-fitting the training data and possible decreasing of 
the generalization ability. 

Therefore, it is natural to put forward another idea: if a 
multiclass problem can be reduced into multiple two-class 
ones, binary weak learning method such as linear discriminant 
analysis [15], [16], decision stump or product of decision 
stumps [17] might be applicable to these decomposed sub- 
problems. To make the reduction, one has to introduce some 
appropriate coding strategy to translate each label to a fixed 
binary string, which is usually referred to as a codeword. Then 
weak hypotheses can be trained at every bit position. For a 
test example, the label is predicted by decoding the codeword 
computed from the strong classifier. AdaBoost.MO [7] is a 
representative algorithm with this coding-decoding process. To 
increase the distance between codewords and thus improve the 
error correcting ability, Dietterich and Bakiri's error-correcting 
output codes (ECOC) [18] can be used in AdaBoost.MO. A 
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variant of AdaBoost.MO, AdaBoost.OC, also combines boost- 
ing and ECOC. However, unlike AdaBoost.MO, AdaBoost.OC 
employs a collection of randomly generated codewords. For 
more details about the random methods, we refer the reader 
to [19]. AdaBoost.OC penalizes both the wrongly classified 
examples and the mislabels in correctly classified examples by 
calculating pseudo-loss. Hence, AdaBoost.OC may be viewed 
as a special case of AdaBoost.M2. The difference is that, weak 
hypotheses in AdaBoost.OC are required to answer one binary 
question for each training example: which is the label for Xi 
in the current round, or 1? Later, Gurus wami and Sahai [20] 
proposed a variant, AdaBoost.ECC, which replaces pseudo- 
loss with the common measurement to compute training errors. 

In this work, we mainly focus on the multiclass boost- 
ing algorithms with binary weak learners. Specifically, Ad- 
aBoost.MO and AdaBoost.ECC. It has been proven Ad- 
aBoost.OC is in fact a shrinkage version of AdaBoost.ECC 
[21]. These two algorithms both perform stage-wise functional 
gradient descent procedures on the exponential loss function 
[21]. In [8], Shen and Li have shown that l\ norm regu- 
larized boosting algorithms including AdaBoost, LogitBoost 
and LPBoost, might be optimized through optimizing their 
corresponding dual problems. This primal-dual optimization 
technique is also implicitly applied by other studies to ex- 
plore the principles of boosting learning [22], [23]. Here 
we study AdaBoost.MO and AdaBoost.ECC and explicitly 
derive their Lagrange dual problems. Based on the primal- 
dual pairs, we put these two algorithms into a column gen- 
eration based primal-dual optimization framework. We also 
analytically show the boosting algorithms that we proposed 
are totally corrective in a relaxed fashion. Therefore, the 
proposed algorithms seem to converge more effectively and 
maximize the margin of training examples more aggressively. 
To our knowledge, our proposed algorithms are the first totally 
corrective multiclass boosting algorithms. 

The notation used in this paper is as follows. We use the 
symbol M to denote a coding matrix with M(a, b) being its 
(a, 6)-th entry. Bold letters (it, v) denote column vectors. and 
1 are vectors with all entries being and 1 respectively. The 
inner product of two column vectors u and v are expressed as 
n^v = ^ UiVi. Symbols y, ^ placed between two vectors 
indicate that the inequality relationship holds for all pairwise 
entries. Double-barred letters (R, Y) denote specific domains 
or sets. The abbreviation s.t. means "subject to". 1(tt) denotes 
an indicator function which gives 1 if tt is true and otherwise. 

The remaining content is organized as follows. In Section II 
we briefly review the coding strategies in multiclass boosting 
learning and describe the algorithms of AdaBoost.MO and 
AdaBoost.ECC. Then in Section III we derive the primal- 
dual relations and propose our multiclass boosting learning 
framework. In Section IV we compare the related algorithms 
through several experiments. We conclude the paper in Section 
V. 

II. Multiclass boosting algorithms and coding 

MATRIX 

In this section, we briefly review the multiclass boosting 
algorithms of AdaBoost.MO [7] and AdaBoost.ECC [20]. 



Algorithm 1 AdaBoost.MO (Schapire and Singer, 1999) 

Input training data (a;,, j/j), j/j S {1, . . . , C}, i = 1, . . . , N; 

maximum training iterations T, and the coding ma- 
trix M CxL . 
Initialization 

Weight distribution 

u it i = jfj;, i = 1, . . . , N, I = 1, . . . , L. 
for t = 1 : T do 

a) Normalize u; 

b) Train L weak hypotheses h^\-) according to the 
weight distribution u; 

c) Compute e = £. u it il(M{ yi ,l) ^ hf ] ( Xi )); 

d) Compute w« = |ln(^); 

e) Update u it i = u it i exp(-o;WM(j/ i , l)hf\xi))\ 
end for 

output /(•) = [YsM^h? {■),■■■ ,E t w (t) 4 t) (-)] T - 



A typical multiclass classification problem can be ex- 
pressed as follows. A training set for learning is given by 
{(xi, yi)}fL 1 . Here Xi is a pattern and y, is the label, which 
takes a value from the space Y = {1, 2, . . . , C} if we have C 
classes. The goal of classification is then to find a classifier 
/ : X — > Y which assigns one and only one label to a new 
observation (x,y) with a minimal probability of y ^ f(x). 
Boosting algorithm tries to find an ensemble function in 
the form of f(x) = Y^t^i^^ ] h^\x) (or equivalently the 
normalized version J2t=if( x )/ J2t w ^)> where h() denotes 
the weak hypotheses generated by base learning algorithm and 
u> = [a/ 1 ) • • • a/ T )] T denotes the associated coefficient vector. 
Typically, a weight distribution u is used on training data, 
which essentially makes the learning algorithm concentrate 
on those examples that are hard to distinguish. The weighted 
training error of hypothesis h(-) on u is given by J2i u i^-(Ui 
h(xi)). 

To decompose a multiclass problem into several binary 
subproblems, a coding matrix M e {±l} CxL is required. 
Let M(c, :) denote the c-th row, which represents a L-length 
codeword for class c. One binary hypothesis can be learned 
then for each column, where training examples has been 
relabeled into two classes. For a newly observed instance x, 
f(x) outputs an unknown codeword. Hamming distance or 
some loss-based measure is used to calculate the distances 
between this word and rows in M. The "closest" row is 
identified as the predicted label. For binary strings, loss-based 
measures are equivalent to Hamming distance. 

The coding process can be viewed as a mapping to a new 
higher dimensional space. If the codewords in the new space 
are mutually distant, the more powerful error-correction ability 
can be gained. For this reason, Dietterich and Bakiri's error- 
correcting output codes (ECOC) are usually chosen. Especially 
if the coding matrix equals to the unit matrix (i.e. , each 
codeword is one basis vector in the high dimensional space), 
no error code could be corrected. This is the one- against- all 
or one-per-class approach. Several coding matrices have been 
evaluated in [24]. Another family of codes are random codes 
[19], [20], [25]. Compared with fixed codes, random codes 
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Algorithm 2 AdaBoost.ECC (Guruswami and Sahai, 1999) 
Input training data (x u yi ), yi e {1, . . . , C}, i = 1, . . . , N; 

maximum training iterations T. 
Initialization 

The coding matrix M = [], and the weight distribution 

u i,c — N(C—1) 1) ■ ■ ■ i C 1, . . . , C. 

for t = 1 : T do 

a) Create M(:,t) G {-l,+l} Cxl ; 

b) Normalize it; 

c) Compute weight distribution for mislabels 

di = EcVi,cl(M(c,t)?M( yi ,t)); 

d) Normalize d; 

e) Train a weak hypothesis h^(-) using d; 

f) Compute e = £. d;l(M( yi , t) ^ hW(asi)); 

g) Compute uW = ± ln(^); 

h) Update Uj iC = 

Uj , c exp (-w« (M( W ,t) - M(c,i))/i (t) (^)); 
end for 

Output /(•) - M 1 )/^-),-" ,u;( T )M T )(-)] T . 



are more flexible to explore the relationships between different 
classes. 

AdaBoost.MO employs a predefined coding matrix, such 
as ECOC. An L-dimensional weak hypothesis is trained at 
each iteration, with one entry for a binary subproblem. The 
pseudo-code of AdaBoost.MO is given in Algorithm 1. For a 
new observation, the label can be predicted by 

Q 

y = arg max |M(c, :)f(x)} 

c=l 

T 

= argmax{^]w®M(c,:)^(a!)}. (1) 
j=i 

For AdaBoost.ECC, an incremental coding matrix is involved. 
At each iteration, a randomly generated code is added into the 
matrix, which corresponds a new binary subproblem being 
created. AdaBoost.ECC is summarized in Algorithm 2. The 
prediction is implemented by 

T 

y = argmaxjV M (c, j)h (j) (x)} . (2) 

Comparing (2) with (1), we can see AdaBoost.ECC is sim- 
ilar to AdaBoost.MO in terms of coding-decoding process. 
Roughly speaking, the coding matrix in ECC is degenerated 
into a random changing column; accordingly, induced binary 
problems turn to be a single one. The relationship between 
these two algorithms will be completely clear after we derive 
the dual problems in the next section. 

III. Totally corrective multiclass boosting 

In this section we present the t\ norm regularized optimiza- 
tion problems that AdaBoost.MO and AdaBoost.ECC solve, 
and derive the corresponding Lagrange dual problems. Based 
on the column generation technique, we design new boosting 
methods for multiclass learning. Unlike conventional boosting, 
the proposed algorithms are totally corrective. For ease of 



exposition, we name our algorithms as MultiBoost.MO and 
MultiBoost.ECC. 

A. Lagrange Dual of AdaBoost.MO 

The loss function that AdaBoost.MO optimizes is proposed 
in [21]. Given a coding strategy Y — > M CxL , the loss function 
can be written as 

N L 

i MO =^2^2exp(-M(y i ,l)F l {x i )) , (3) 

i=l 1=1 

where Fi{x{) is the Z-th entry in the strong classifier and 
Fi(xi) — Y^t=i UJ ^^ l f\ x i)- F° r a training example x i7 let 
Hi(xi) = [h^\xi) nf\xi) ■ ■ ■ fti T ^(xj)] T denote the outputs 
of the Z-th weak hypothesis at all T iterations. The boosting 
process of AdaBoost.MO is equivalent to solve the following 
optimization problem: 

N L 

min 25^exp(-M(i/ < ,0Hj'(sB j )«) (4) 
i=i i=i 

s.t. u> y 0, ||w||i = 9. (5) 

Notice that the constraint ||u>||i = with 6 > is not 
explicitly enforced in AdaBoost.MO. However, if the variable 
uj is not bounded, one can arbitrarily make the loss function 
approach zero via enlarging u) by an adequately large factor. 
For a convex and monotonically increasing function, ||u>|| i = 
is actually equivalent to ||u>||i < 9 since u) always locates at 
the boundary of the feasibility set. 

By using Lagrange multipliers, we are able to derive the 
Lagrange dual problem of any optimization problem [26]. If 
the strong duality holds, the optimal dual value is exactly the 
same as the optimal value of primal problem. 

Theorem 1: The Lagrange dual problem of program (4) is 

N L 

max -r9 - V" VV; log tt» / + l T u (6) 

r.u * — ' ^ — ' 

i=l 1 = 1 

N L 

s.t. 5^5^u iiI M(i/ i ,0Hj'(a: i ) r< rl T ,uh 0. 
i=i i=i 

Proof: To derive this Lagrange dual, one needs to intro- 
duce a set of auxiliary variables 7^; = —M(y i} l)¥Sj(xi)u}, 
i = 1, . . . , N, I = 1, . . . , L. Then we can rewrite the primal 
(4) into 

N L 

min ^^exp7 M (7) 

i=l 1 = 1 

s.t. lhl = -M(i/ j ,0H[(as i )w, 
w^O, ||u>||i = 6. 

The Lagrangian of the above program is 

N L 

L(<*>, 1, <?, u, r) = ^2 exp _ qTuJ + r ( lTw - 6 ) 

i=l 1 = 1 
N L 

-EE"*.* hi,k + M{ yi ,l)H]( Xi )u>) (8) 

i=l 1 = 1 
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with q y 0. The Lagrange dual function is defined as the 
infimum value of the Lagrangian over variables lj and 7. 



N L 



N L 



inf L = inf ^ ^ exp7 M - ^ ^Ui.^.,! ~ ^ 



i=l i=l »=1 J=l 

must be 
a 

AT L 



i=l Z=l 

AT L 



inf ^2 E exp 7 ^ _ E E w *. i7i .' ~~ r6 * 



i=l i=l i=l Z=l 

conjugate of exp. function 



N L 



i=l Z=l 



-^2^2(u itl log Ul j - u hl ) - r6. 



(10) 



i=i 2=1 



The convex conjugate (or Fenchel duality) of a function / : 
R -> R is defined as 



g(y) = sup (/x - f{x)). 

xEdomf 



(ii) 



Here we have used the fact that the conjugate of the exponen- 
tial function 7(7) = e 7 is g(u) = u log u — u, if and only if 
u > 0. log is interpreted as 0. 

For each pair (it, r), the dual function gives a lower bound 
on the optimal value of the primal problem (4). Through 
maximizing the dual function, the best bound can be obtained. 
This is exactly the dual problem we derived. After eliminating 
q and collecting all the constraints, we complete the proof. ■ 

Since the primal (4) is a convex problem and satisfies 
Slater's condition [26], the strong duality holds, which means 
maximizing the cost function in (6) over dual variables u and 
r is equivalent to solve the problem (4). Then if a dual optimal 
solution (u* , r* ) is known, any primal optimal point is also a 
minimizer of L(u>, 7, q, u*,r*). In other words, if the primal 
optimal solution u>* exists, it can be obtained by minimizing 
L(w,7,q,tx*,r*) of the following function (see (9)): 



N L 



-r*9 + ^^(cxp(- M(y u l)Hj(xi)u) 



i=l 1=1 



(12) 



We can also use KKT conditions to establish the relationship 
between the primal variables and dual variables [26]. 

B. MultiBoost.MO: Totally Corrective Boosting based on Col- 
umn Generation 

Clearly, we can not obtain the optimal solution of (6) until 
all the weak hypotheses in constraints become available. In 
order to solve this optimization problem, we use an optimiza- 
tion technique termed column generation [11]. The concept of 
column generation is adding one constraint at a time to the dual 
problem until an optimal solution is identified. In our case, we 
find the weak classifier at each iteration that most violates the 
constraint in the dual. For (6), such a multidimensional weak 



classifier h*(-) = [hl(-) ■ ■ ■ ^(')] T can be found by solving 
the following problem: 

L N 

h*(-) = argmax V] V] u it iM(y h l)h t (xi), (13) 
which is equivalent to solve L subproblems: 

AT 

^*(-)= ar S max y2ui,iM(yi,l)hi{xi), k = l,...,L. (14) 

If we view as the weight of the coded training example 
(xi, M(yi,l)), i = 1,...,N, I — 1,...,L, this is exactly 
the same as the strategy AdaBoost.MO uses. That is, to 
find L weak classifiers that produce the smallest weighted 
training error (maximum algebraic sum of weights of correctly 
classified data) with respect to the current weight distribution 
u. 

When a new constraint is added into the dual program, 
the optimal value of this maximization problem (6) would 
decrease. Accordingly the primal optimal value decreases too 
because of the zero duality gap. The primal problem is convex, 
which assures that our optimization problem will converges to 
the global extremum. In practice, MultiBoost.MO converges 
quickly on our tested datasets. 

Next we need to find the connection between the primal 
variables u> and dual variables u and r. According to KKT 
conditions, since the primal optimal u>* minimizes (12) over 
cl>, its gradient must equals to at u>*. Thus we have 



u* yl =ex P (-M(y i ,i)H]'(x i )u;* 



(15) 



In our experiments, we have used MOSEK [27] optimization 
software, which is a primal-dual interior-point solver. Both the 
primal and dual solutions are available at convergence. 

C. Lagrange Dual of AdaBoost.ECC and MultiBoost.ECC 

The primal-dual method is so general, actually, arbitrary 
boosting algorithms based on convex loss functions can be 
integrated into this framework. Next we investigate another 
multiclass boosting algorithm: AdaBoost.ECC. 

Let us denote p t . c [h (t) ] = (M(y u t) - M(c,t)) fe (t) (^)- If 
we define the margin of example (a;,, y,) on hypothesis h^(-) 
as 



Pl [h^\ 4 ^[ftO]} (16) 

c£ Hit 

= M( 2/i ,t)/i( t )( :Ei )-max{M(c,t)/i«(x i )}, 



the margin on assembled classifier <;(■) can be computed as 

PiW = min{pi,cW} 

= M{y u :)s(xi) - max{M(c, :)s(xi)} 

eg °if 

^^(tuVfHAh®]}- d7) 



2>« 



o^Vi t=l 
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AdaBoost.ECC tries to maximize the minimum margin by 
optimizing the following loss function [21]: 



N C 



L ECC = ^^exp(-(M( yi ,:)-M(c, :))/(<)) 

1 = 1 <==! 

= E E ex p (-X> (t W/> (t) ]V (is) 

2-1 c=l \ £=1 / 

Denote P jjC = [pi, c [/i (1) ] p t A h{2) ] ' ' ' Pi,c[/i (T) ]] T . Obviously, 
Pi, yt = T for any example Xj. Therefore the problem we are 
interested in can be equivalently written as: 



N C 

m j n EE ex p( _p Ic w ) 

1=1 c=l 

s.t. a; >: 0, ||u>||i = 0. 



(19) 



Like AdaBoost.MO, we have added an i\ norm constraint 
to remove the scale ambiguity. Clearly, this is also a convex 
problem in uj and strong duality holds. 

Theorem 2: The Lagrange dual problem of (19) is 



N C 



N C 



max -r6 - ^ y^u itC log u ijC + ^ y^u iiC (20) 

z=l c=l i=l c=l 

JV C 

s.t. ^^, c pT jC ^rl T ,tx^0. 

i=l c=l 

The derivation is very similar with that in the proof of 
Theorem 1. Notice that the first constraint has no effect on 
variables Ui m since Pi tVi — T , Mi = 1,...,N, however, 
the problem is still bounded because — ulogu + u < 1 for 
all u > 0. Actually, we have m tyi = 1 in the process of 
optimization. 

To solve this dual problem, we also employ the idea of 
column generation. Hence at t-th iteration, such an optimal 
weak classifier can be found by 

&*(■) = argmax V f V u iyC (M( yi , t) - M(c, t)) ) h( Xi ). 
MO ti V c J 

(21) 

Notice that M{y u t) - M(c,t) = 2M(y l ,t)l(M(y i ,t) ^ 
M(c,t)), so if we rewrite (21) in the following form: 

N , 

h*(-) = argmax E((E Ui > c l(M( yi ,t) ? M(c,t))) 



i=i 



M(yi,t)h(xi) ) , 



(22) 

it it straightforward to show that we can use the same strategy 
with AdaBoost.ECC to obtain weak classifiers. To be more 
precise, the strategy is to minimize the training error with 
respect to the mislabel distribution. 

Looking at optimization problem (6) and (20), they are quite 
similar. If we denote Pi,i[h^] — M(yi,l)hf\xi) in the first 



Algorithm 3 Totally Corrective Multiclass Boosting 
Input training data (ajj, j/j), i = 1, . . . ,N; termination 
threshold e > 0; regularization parameter 9 > 0; 
maximum training iterations T. 
(1) Initialization 

t = 0; u> = 0; r = 0; 
Ui.k = jtk, i = 1, • • • , N, k = 1, . . . , K. 
while true do 

(2) Find a new weak classifier h*(-) by solving 
subproblem in column generation: 

h* = argmax^ £\ Efc Ui,kpi,k[ h ]> 

(3) Check if dual problem is bounded by new constraint: 
if Ej Efc u it kPi,k[h*] <r + e, then break; 

(4) Add new constraint to dual problem; 

(5) Solve dual problem to obtain updated r and u: 
max rjU -rd - ^ . J2k u %k log u ijk + 1 T « 

s -t- E 4 Efe Ui,kPiA hij) } <r, j = l,...,t; 

(6) t = t + 1; if t > T, then break; 
end while 

(7) Calculate the primal variable u> according to dual 

solutions and KKT condition. 
Output /(•) =ELi w( W ) 0). 



case, the margin of example (xt, yi) on hypothesis fcW would 

be pi[h^] — min;{pi j i[ft.(*)]}, and also 

Pi[s] = mm{M(yi,l)fi(xi)} 

= ]^™}*{12" {t) PiAh {t) }}. (23) 
IMIi 1 t 

Based on different definitions of margin, these two problems 
share exactly the same expression. To summarize, we combine 
the algorithms that we proposed in this section and give a 
general framework for multiclass boosting in Algorithm 3. 

D. Hinge Loss Based Multiclass Boosting 

Within this framework, we can devise other boosting al- 
gorithms based on different loss functions. Here we give an 
example. According to (1), if a pattern Xi is well classified, it 
should satisfy 

M( yi ,:)f(x)>M(c,:)f(x) + l,\/c^ yi (24) 

with ip > 0. Define the hinge loss for 

& = max{M(c, :)f(x) + 1 - 5 c>Vt } - M{y u :)f(x), (25) 

c 

where S c . yi = 1 if c = yi, else 0. That is to say, if Xi is 
fully separable, then = 0; else it suffers a loss £j > 0. So 
the problem we are interested in is to find a classifier /(•) = 
[hW(-) • • • ft,( T '(-)] T u; through the following optimization: 



N 

rain 

«=1 



(26) 



s.t. M(c, :)/(x) + 1 - 5 CM - M(y t , :)f(x) < ^,Vi,c; 
w>:0;|H|i=e. 
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Theorem 3: The equivalent dual problem of (26) is 

N C 

r6 + / j y 5c, Vi Ui,c 



mm 



(27) 



= 1 c=l 



N C 



s.t. ^^(Mfe.il-Mtc,:))^)^) <r,Vj; 



i=l c=l 



l,Vi. 



Proof: The Lagrangian of this program is a linear function 
on both £ and u>, therefore, the proof is easily done by letting 
the partial derivations on them equal zero, and substituting the 
results back. ■ 
Suppose the length of a codeword M(c, :) is L. Using the 
idea of column generation, we can iteratively obtain weak 
hypotheses and the associated coefficients by solving 



h*(-) = argmax V u i<c V {M{y u t) - M(c, t)) h(xi), 

MO ~c 1=1 

(28) 

or L subproblems 



h*(-) — argmax 
H-) 

(29) 

This is exactly the same as (21). In other words, we can follow 
the same procedures as in AdaBoost.ECC to obtain each entry 
of weak hypotheses. 

The proposed boosting framework may inspire us to design 
other multiclass boosting algorithms in the primal by consider- 
ing different coding strategies and loss functions. We can use 
a predefined coding matrix to train a set of multidimensional 
hypotheses as in AdaBoost.MO, at the same time, penalize 
the mismatched labels as in AdaBoost.ECC. It seems to be a 
mixture of these two algorithms. However, this is beyond the 
scope of this paper. 

E. Totally Corrective Update 

Next, we provide an alternative explanation for the new 
boosting methods. In Algorithm 3, suppose t weak hypotheses 
have been found while the cost function still does not converge 
{i.e. , the t-th constraint is not satisfied). To obtain the updated 
weight distribution u for the next iteration, we need to solve 
the following optimization problem: 



N K 



mm 



S.t. 



2^ 2_j^i,k log Ui,k 
i=i fc=i 

[hW]<r,j = l,... 



(30) 
(31) 



u y o,i u = l, 



where vJ p[ftO')] = Y^Li Hk^i-MPiA^]- In other words, 
M (t+i) T p [/jO)] < r holds for j = 1, . . . , t. 

In [22], Kivinen and Warmuth have shown that the cor- 
rective update of weight distribution in standard boosting 
algorithms can be viewed as a solution to the divergence min- 
imization problem with the constraint of u,( t+1 ) p[h^'] = 0. 




20 30 
iterations 

(a) 



0.08 



0.06 



0.04 



0.02 




20 30 40 50 
iterations 



Fig. 1. Inner products between the new distribution and the past mistake 
vectors F(t) = {u( t+1 ' p[/i^]}|2.;p t > j on the training data of wine. In 
(a) AdaBoost.MO, F(t) starts with F(j) = (corrective update) and quickly 
becomes uncontrollable; in (b) MultiBoost.MO, F(t) is consistently bounded 
by r. 



This means that the new distribution it( t+1 ) should be closest 
with «w but uncorrected with the mistakes made by the 
current weak hypothesis. If u,( t+1 ) is further required to be 
orthogonal to all the past t mistake vectors: 

" T p[h^} =0,Vj = !,...,*, (32) 



(t+i) 1 



u 

the update technique is called a totally corrective update. At 
the same time, the previous t hypotheses receive an increment 
in their coefficient respectively. 

Notice that there is not always an exact solution to (32). 
Even if a solution exists, this optimization problem might 
become too complex as t increases. Some attempts have 
been made to obtain an approximate result. Kivinen and 
Warmuth [22] suggested using an iterative approach, which 
is actually a column generation based optimization proce- 
dure. Oza [28] proposed to construct u' t+1 ' by averaging 
t + l distributions computed from standard AdaBoost update, 
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which is in fact the least-squares solution to linear equa- 
tions [p^ 1 * 1 ] • • • p[/i (t) ]] T it = 0. The stopping criteria of 
his AveBoost implied a continuous descent of inner products 
as in our algorithm, but exhibited in a heuristic 
way. Jiang and Ding [29] also noticed the feasibility problem 
of (32). They proposed to solve a subset, say m equations 
instead of the entire, however, they did not indicate how to 
choose to. 

In our algorithms, all the coefficients {u>^Yj=i associated 
to weak hypotheses are updated at each iteration, while the 
inner products of distribution u,( t+1 ) and p[h^} are consis- 
tently bounded by r. In this sense, our multiclass boosting 
learning can be considered as a relaxed version of totally 
corrective algorithm with slack variable r. Figure 1 illus- 
trates the difference between corrective algorithm and our 
algorithm. Intuitively, it is more feasible to solve inequalities 
(31) than the same size of equations (32), although most 
(but not all, see Fig. 1) equalities in (31) are satisfied during 
the optimization process. Analogous relaxation methods have 
appeared in LPBoost [11] and TotalBoost [30]. In contrast, our 
algorithm simultaneously restrains the distribution divergence 
and correlations of hypotheses. The parameter 9 is a trade- 
off that controls the balance. To our knowledge, this is the 
first algorithm to introduce the totally corrective concept to 
multiclass boosting learning. 

If we remove the constraints on u> in primal, for example, 
constraints (5), the Lagrange dual problem turns to be a totally 
corrective minimizer of negative entropy: 

N K 

min y^M l .fe log in k (33) 

r.u * — ' * — ' 

i=l fe=l 

s.t. u T p[h^]=0,j = l,...,t, 

u y o,i t m = l, 

which is similar to the explanation of corrective update in [22], 
where the distribution divergence is measured by relative en- 
tropy (Kullback-Leibler divergence). However, the constraints 
on u> are quite important as we discussed before, and should 
not be simply removed. Therefore, it seems reasonable to 
use a relaxed version, instead of standard totally corrective 
constraints for distribution update of boosting learning. 

It has been proven that totally corrective boosting reduces 
the upper bound on training error more aggressively than stan- 
dard corrective boosting (apparently including AdaBoost.MO 
and AdaBoost.ECC) [29], [31], which performs a slowly stage- 
wise gradient descent procedure on the loss function. Thus, 
our boosting algorithms can be expected to be faster in con- 
vergence than their counterparts. The fast convergence speed 
is advantageous in reducing the training cost and producing a 
classifier composed of fewer weak hypotheses [8], [31]. Fur- 
ther, a simplification in strong classifier leads to a speedup of 
classification, which is critical to many applications, especially 
those with real-time requirements. 

IV. Experiments 

In this section, we perform several experiments to compare 
our totally corrective multiclass boosting algorithms with 



TABLE I 
Multiclass UCI datasets 



dataset 


#train 


#test 


#attribute 


#class 


svmguide2 


391 




20 


3 


svmguide4 


300 


312 


10 


6 


wine 


178 




13 


3 


iris 


150 




4 


3 


glass 


214 




9 


6 


thyroid 


3772 


3428 


21 


3 


dna 


2000 


1186 


180 


3 


vehicle 


846 




18 


4 



previous work, including MultiBoost.MO against its stage- 
wise counterpart AdaBoost.MO, and MultiBoost.ECC against 
its stage-wise counterpart AdaBoost.ECC. For MultiBoost.MO 
and AdaBoost.MO, we design error-correcting outputs codes 
(ECOC) [18] as the coding matrix. For our new algorithms, 
we solve the dual optimization problems using the off-the-shelf 
MOSEK package [27]. 

The datasets used in our experiments are collected from 
UCI repository [32]. A summary is listed in Table I. We 
preprocess these datasets as follows: if it is provided with 
a pre-specified test set, the partitioning setup is retained, 
otherwise 70% samples are used for training and the other 
30% for test. On each test, these two sets are merged and 
rebuilt by random selecting examples. To keep the balance of 
multiclass problems, examples associated with the same class 
are carefully split in proportion. The boosting algorithms are 
conducted on new sets. This procedure is repeated 20 times. 
We report the average value as the experimental result. 

In the first experiment, we choose decision stumps as 
the weak learners. As a binary classifier, decision stump is 
extensively used due to its simplicity. The parameters are 
preset as follows. The maximum number of training iterations 
is set to 50, 100 and 500. An important parameter to be 
tuned is the regularization parameter 9, which equals to the £ i 
norm of coefficient vector associated with weak hypotheses. 
A simple method to choose 9 is running the corresponding 
stage-wise algorithms on the same data and then computing 
the algebraic sum: 9 = J2j u ^- F° r datasets svmguide2, 
svmguide4, wine, iris and glass, which contain a small number 
of examples, we use this method. The same strategy has been 
used in [8] to test binary totally corrective boosting. 

For the others, we choose 9 from {2, 5, 8, 10, 12, 15, 20, 30, 
40,45, 60,80, 100,120, 150,200} by running a five-fold 
cross validation on training data. In particular, we use a 
pseudo-random code generator in the cross validations of 
AdaBoost.ECC and MultiBoost.ECC, to make sure each can- 
didate parameter is tested under the same coding strategy. 

The experimental results are reported in Table II. As we 
can see, almost all the training errors of totally corrective 
algorithms are lower than their counterparts, except in the 
case both algorithms have converged to 0. In Fig. 2, we show 
the training error curves of some datasets when the training 
iteration number is set to 500. Obviously, the convergence 
speed of our totally corrective boosting is much faster than 
the stage-wise one. This conclusion is consistent with the 
discussion in Section LIFE. Especially on svmguide2, iris and 
glass, new algorithms are around 50 iterations faster than their 
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TABLE II 

Training and test errors (including mean and standard deviation) of AdaBoost.MO, MultiBoost.MO, AdaBoost.ECC and 
multiboost.ecc on uci datasets. average results of 20 repeated tests are reported. base learners are decision stumps. results 

in bold are better than their counterparts. 



dataset 


algorithm 


train error 50 


train error 100 


train error 500 


test error 50 


test error 100 


test error 500 


svmguide2 


AB.MO 


0.016i0.005 


O.OOOiO.000 


OiO 


0.225i0.031 


0.212i0.030 


0.229i0.024 




TC.MO 


0.011i0.003 


OiO 


OiO 


0.224i 0.024 


0.221i0.038 


0.233i0.028 




AB.ECC 


0.049±0.009 


0.003i0.003 


OiO 


0.253i0.032 


0.226i0.031 


0.226i0.03() 




1C.ECC 


0.032i0.009 


It 11/11 1 A 

0.0Uli0.002 


OiO 


0.242± 0.031 


0.223±0.034 


0.221±0.028 


svmguide4 


AB.MO 


0.041±0.004 


0.016i0.002 


OiO 


0.204i0.025 


0.194i0.017 


0.190i0.017 




TC.MO 


4k Jk~Wk 1 / k AA I 

0.039i0.004 


0.014i0.002 


OiO 


0.203i 0.023 


0.196i0.016 


0.l93i().0l8 




AB.ECC 


0.180±0.021 


0.115i0.023 


OiO 


0.285i().023 


0.262i0.025 


0.233i0.023 




1C.ECC 


0.158i0.023 


A Aon 1 A inn 

0.087i0.020 


OiO 


A ^Tff 1 A A^^ 

0.275±0.022 


0.245±0.028 


a an i ck An 

0.237i0.022 


wine 


AB.MO 


OiO 


OiO 


OiO 


0.032i0.015 


0.030i0.029 


0.031i0.028 




TC.MO 


OiO 


OiO 


OiO 


0.032i0.016 


0.031i0.023 


0.032i0.026 




AB.ECC 


OiO 


OiO 


OiO 


0.026i0.020 


0.032i0.015 


0.026i0.024 




TC.ECC 


OiO 


OiO 


OiO 


U.UZozhO.Uzz 


U.Uj2±U.U34 


O.OzozhU.Uiy 


iris 


AB.MO 


O.OOOiO.OOl 


OiO 


OiO 


0.062±0.026 


0.064±0.025 


0.060±0.032 




TC.MO 


OiO 


OiO 


OiO 


0.062± 0.026 


0.067±0.U23 


A AffTiA / 1 T ~ 

U.U57±U.U25 




AB.ECC 


OiO 


OiO 


OiO 


0.057i 0.024 


0.062i0.026 


0.053±0.032 




TC.ECC 


OiO 


OiO 


OiO 


U.UolzhO.Uzl 


el ("i^."7_i_n mn 
U.Uo/±U.UZU 


A AC1 _LA A'lO 


glass 


AB.MO 


0.026i0.003 


().003i0.001 


OiO 


0.275±0.034 


0.246±0.061 


0.268±0.051 


TC.MO 


A All-LA AA1 


A AA1_L_A AA1 


OiO 


0.280±0.0jy 


0.252±0.U59 


0.2/J±0.047 




AB.ECC 


0.168±0.032 


0.078±0.0l8 


OiO 


0.352±0.052 


0.313±0.043 


0.298±0.044 




IL.hLL 


A 1 1 ij_ft mil 


A A1AJ.ft A1 1 


0±0 


A inj_A A/IO 

U.j27±U.U4o 


A 1A1XA Ai; 

U.JU2±U.UJ5 


0.J0d±U.U4j 


thyroid 


AB.MO 


0.003±0.001 


0.001 ±0.000 


OiO 


0.006±0.001 


0.006±0.00l 


0.006±0.002 




TC.MO 


J"k fkik 1 1 A t\ /\ * 

O.OOliO.OOl 


O.OOliO.OOl 


OiO 


O.OObiO.OOl 


rk r\r\/~ I /k rkr\ i 

0.006i0.001 


().006i().001 




AB.ECC 


0.006i0.001 


0.002i0.001 


OiO 


0.010i0.002 


0.008i0.002 


0.005i0.001 




TC.ECC 


O.OOOiO.OOl 


OiO 


OiO 


0.006i0.002 


0.006i0.002 


0.004i0.000 


dna 


AB.MO 


0.053i0.002 


0.040i0.002 


0.028i0.002 


0.076i0.007 


0.066i0.006 


0.061i0.006 




TC.MO 


0.052i0.004 


0.039i0.003 


().029i().()02 


0.078i0.007 


0.064i0.006 


0.054i0.005 




AB.ECC 


0.070±0.004 


().049i0.005 


0.017i0.004 


0.089i().008 


0.077i0.009 


0.069i0.005 




TC.ECC 


0.059i0.005 


0.041i0.006 


().028i().()05 


0.083i0.008 


0.070i0.006 


0.065i0.004 


vehicle 


AB.MO 


0.099i0.003 


0.073i0.003 


0.018i0.000 


0.249i0.020 


0.245i0.019 


0.212i0.010 




TC.MO 


0.086i0.009 


0.048i0.007 


0.019i0.027 


0.241i0.020 


0.231i0.018 


0.211i0.021 




AB.ECC 


0.271i0.010 


0.207i0.011 


().096i0.010 


0.359i0.022 


0.300i0.021 


0.249i0.017 




TC.ECC 


0.208i0.018 


0.140i0.019 


0.055i0.011 


0.327i0.022 


0.287i0.024 


0.257i0.028 



counterparts. 

In terms of test error, it is not apparent which algorithm 
is better. Empirically speaking, the totally corrective boosting 
has a comparable generalization capability with the stage-wise 
version. It is noticeable that on thyroid, dna and vehicle, where 
the regularization parameter is adjusted via cross-validation, 
our algorithms clearly outperform their counterparts. We con- 
jecture that if we tune this parameter more carefully, the 
performance of new algorithms could be further improved. 

In the second experiment, we change the base learner with 
another binary classifier: Fisher's linear discriminant function 
(LDA). For simplicity, we only run AdaBoost.ECC and Multi- 
Boost.ECC at this time. All the parameters and settings are 
the same as in the first experiment. The results are reported in 
Table III. Again, the convergence speed of totally corrective 
boosting is faster than gradient descent version. We also notice 
that two algorithms of ECCs are better with LDAs than with 
decision stumps on svmguide2, but worse on svmguide4, glass, 
thyroid and vehicle, although LDA is evidently stronger than 
decision stump. 

To further verify the generalization capability of our mul- 
ticlass boosting algorithms, we run the Wilcoxon rank-sum 
test [33] on test errors in Tables II and III. Wilcoxon rank- 
sum test is a nonparametric statistical tool for assessing the 
hypothesis that two sets of samples are drawn from the identi- 
cal distribution. If the totally corrective boosting is comparable 
with its counterpart in classification error, the Wilcoxon test is 



supposed to output a higher significant probability. The results 
are reported in Table IV. We can see the probabilities are 
higher enough (> 0.8) to claim the identity, except in the case 
ECC algorithms with decision stumps when T — 50 and ECC 
algorithms with LDAs when T = 500. However, if we take a 
close look at those two cases, we can find where our totally 
corrective algorithms perform better than their counterparts. 



TABLE IV 

WlLCONXON RANK-SUM TEST ON CLASSIFICATION ERRORS 



algorithms 


T = 50 


T = 100 


T = 500 


MOs with stumps 


0.902 


0.878 


1.000 


ECCs with stumps 


0.743 


0.821 


0.983 


ECCs with LDAs 


0.959 


1.000 


0.798 



Next, we investigate the minimum margin of training ex- 
amples, which has a close relationship with the generalization 
error [10]. Warmuth and Ratsch [30] have proven that by 
introducing a slack variable r, totally corrective boosting 
can realize a larger margin than corrective version with the 
same number of weak hypotheses. We test this conclusion 
on datasets svmguide2, svmguide4 and iris. At each iteration, 
we record the minimum margin of training examples on 
the current combination of weak hypotheses. The results are 
illustrated in Fig. 3. It should be noted in algorithms of MOs 
and ECCs, the definitions of margin are different: (23) and 
(17) respectively. However, it is clear that in any case, totally 
corrective boosting algorithms increase the margin much faster 
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TABLE III 

Training and test errors (including mean and standard deviation) of AdaBoost.ECC and MultiBoost.ECC. The average results 
of 20 repeated tests are reported. Base learners are Fisher's linear discriminant functions. 



dataset 


algorithm 


train error 50 


train error 100 


train error 500 


test enor 50 


test error 100 


test error 500 


svmguide2 


AB.ECC 


U.Uj JihU.Uzl 


c\ nni_Lo nnt 




a 11 /i_i_A All 


A H/l_i_A in: 
U.2Z4±U.U2S 


A 1 1V7XI1 A11 

V.ly /±U.UZ1 




IC.hCC 


A _L A All 
U.U31±U.U13 


0±0 


0±0 


a no 1 A f\A A 

U.228±0.044 


0.229±U.024 


A 11 1 1 A im 

0.221±0.027 


c vm m 1 1 H (*A 

ct V111K141U.CT 








000-1-0 00 1 


457-1-0 {W(\ 


04Q 


n T284-0 f)37 




iC.hCC 


A 1 ~ ' . — | A Air 

U.275±0.U35 


A 1 in l A 111 ) 

0.149±U.U33 


OzbO 


A I1U I I I A 1 1 

0.41o±0.041 


A no I A 1 1 — 1 

0.377±0.051 


/"\ O O 1 fl A/IO 

(J.33o±0.048 


wine 


a T3 err 


U±U 


UztU 


A 1 A 

UztU 




U.U2 /ztU.UZj 


U.UJUztU.UZl 




iC.hCC 


0±0 


0±0 


0±0 


0.025± 0.027 


U.020±0.022 


0.015±0.014 


ins 


ar Frr 


0-1-0 
UZlU 


O-t-0 
UZlU 




a Adi -1-0 ft^n 




n os^-i-0 o^s 




TC.ECC 


o±o 


o±o 


0±0 


0.046±0.037 


0.042±0.026 


0.049±0.038 


glass 


AB.ECC 


0.194±0.033 


0.080±0.021 


()±() 


0.384±0.051 


0.357±0.051 


0.369±O.O52 




TC.ECC 


0.077±0.028 


0.001±0.002 


0±0 


0.374±0.046 


0.364±0.047 


0.359±0.058 


thyroid 


AB.ECC 


0.041±0.004 


0.035±0.005 


0.001±0.001 


0.048±0.006 


0.046±0.004 


0.032±0.004 




TC.ECC 


0.033±0.006 


0.018±0.010 


0±0 


0.043±0.007 


0.040±0.004 


0.030±0.001 


dna 


AB.ECC 


0.000±0.000 


O.OOOiO.OOO 


O.OOOiO.000 


0.081±0.007 


0.084±0.009 


0.077±0.010 




TC.ECC 


O.OOOiO.OOO 


O.OOOiO.OOO 


O.OOOiO.OOO 


0.068±0.008 


0.065±0.007 


0.064±0.008 


vehicle 


AB.ECC 


0.237±0.012 


0.176±0.018 


0.004±0.002 


0.301±0.017 


0.297±0.021 


0.276±0.026 




TC.ECC 


0.196±0.015 


0.125±0.011 


0±0 


0.312±0.029 


0.313±0.027 


0.272±0.037 



0.35 
0.3 
0.25 

b 

5 0.2 

1 0.15 

0.1 

0.05 


10' 



svmguide2 









AB.MO 






TC.MO 






AB.ECC 




\ \ 


TC.ECC 




* 1 
\ t 

v. ' «. » 




V. Y 1 

X V 1 
\ \ ~ 1 




\ \ 

\ 


0->_ 



0.8 r— 
0.7 -V 
0.6- 
S 0.5- 

0) 

gO.4- 
g 

g 0.3 - 
0.2 
0.1 



svmguide4 



— AB.MO 

— TC.MO 

■ - AB.ECC 

■ - TC.ECC 



10 10 
number of iterations 




0.35 
0.3 
0.25 

b 

5 0.2 

= 0.15 
a 

0.1 
0.05 



-AB.MO 
-TC.MO 
-AB.ECC 
-TC.ECC 



10 10 
number of iterations 




number of iterations 



(a) 



(b) 



(c) 



number of iterations 



10 10 
number of iterations 



AB.MO 
TC.MO 
AB.ECC 
TC.ECC 




10 10 
number of iterations 



(d) 



(e) 



Fig. 2. Training error curves of AdaBoost.MO, MultiBoost.MO, AdaBoost.ECC and MultiBoost.ECC on svmguide2, svmguide4, wine, iris, glass and vehicle. 
The number of training iterations is 500. Base learners are decision stumps. 



than the two previous ones. 

V. Discussion and conclusion 

We have presented two boosting algorithms for multiclass 
learning, which are mainly based on derivations of the La- 
grange dual problems for AdaBoost.MO and AdaBoost.ECC. 
Using the column generation technique, we design new totally 
corrective boosting algorithms. The two algorithms can be 



formulated into a general framework base on the concept of 
margins. Actually, this framework can also incorporate other 
multiclass boosting algorithms, such as SAMME. In this paper, 
however, we have focused on multiclass boosting with binary 
weak learners. 

Furthermore, we indicate that the proposed boosting algo- 
rithms are totally corrective. This is the first time to use this 
concept in multiclass boosting learning. We also discuss the 
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E -1 



svmguide2 



svmguide4 




number of iterations 



number of iterations 



number of iterations 



(a) 



(b) 



(c) 



Fig. 3. The minimum margin on assembled classifiers of AdaBoost.MO, MultiBoost.MO, AdaBoost.ECC and MultiBoost.ECC. The definitions of margin 
are different in MOs and ECCs, however, it clearly shows that the totally corrective algorithms realize a larger margin than their counterparts within the same 
iterations. 



reason of introducing slack variables. Experiments on UCI 
datasets show that our new algorithms are much faster than 
their gradient descent counterparts in terms of convergence 
speed, but comparable with them in classification capability. 
The experimental results also demonstrate that totally cor- 
rective algorithms can maximize the example margin more 
aggressively. 
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